Statistical Graphics for High-Dimensional Data

Susan VanderPlas

Outline


  • Introduction
  • Statistical Graphics and Big Data
  • Case Study: Designing Interactive Graphics for Soybean Population Genetic Analysis

Introduction

About Me

PhD in Statistics (2015*)


MS in Statistics (2011)
BS in Applied Mathematical Sciences and Cognitive Psychology (2009)

Statistical Interests

  • Modeling
  • Bayesian statistics
  • Data mining
  • Visualization
  • Simulation
  • Nonparametric statisics
  • Engineering statistics

Research and Collaborations

  • Modeling material structure and composition

    Robust nonparametric statistics for Atom Probe Tomography Spectra

    (MS research, with ISU Material Science & Engineering)
  • Evaluating road safety (with Iowa Dept. of Transportation)
  • Exploring perception of statistical graphics (PhD research)
  • Analysis of soybean genomics (with USDA)

Statistical Philosophy

  • Understand the dataset through exploratory analysis
    • Graphical summaries
    • Summary statistics
    • Identify record errors, data artifacts, other issues that may affect modeling
  • Model the data appropriately
  • Communicate model results and implications clearly
    • Well-designed graphics
    • Simulated model predictions to make model less abstract

Statistical Graphics and Big Data

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Big Data

Visualization is an important tool for working with big data

Graphical adaptations for big data:

  • Overplotting (large \(n\))
  • High-dimensional data (large \(p\))
  • Distributed/multi-source data, hierarchical data
  • No solution (binning, dimension reduction, interactive tour) works for every situation

Interactive Graphics

  • Provide additional information in response to user action

  • Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)

  • Accommodate complex data structures

BUT…

Web-based interactive graphics may be even more size-sensitive than static graphics.

Interactive Visualization of Soybean Population Genetic Data

Soybean Project: People and Institutions

Overall Project Goals:

  • Understand historical yield increases
    100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
  • Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity

  • Communicate analysis results intuitively:
    • Target: Soybean farmers, plant geneticists
    • Provide full results (tables) and graphical summaries
    • Interface with existing databases and web resources

Data


  • Sequencing Data (79 varieties, 75GB processed and compressed)

  • Field Trials (168 varieties, 30 varieties with genetic data)

  • New crosses with highest yield varieties
    (sequencing + field trials)

  • Genealogy as reported in the breeding literature (1600 varieties)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)
  • 79 varieties, 20 chromosomes
  • Phenotype and genealogy information
  • Researchers tend to work on gene subsets:
    Must be able to zoom and filter
  • Optimized files for SNP results are still large (10 GB) and require significant computational resources

Above all, need an interface to allow people to pull new discoveries from the data systematically.

Visualizing SNPs

  • SNP: Single Nucleotide Polymorphism, a single basepair mutation
    (A -> T, G -> A, C -> G)
  • Shiny applet: Responsive applet for user-directed data subsets
  • Show multiple levels of detail (less detail = lower computational load)
  • Provide resources in the applet for user exploration (not just a reference tool)

Applet Design

SNP Population Distribution

SNP Applet Overview

Density of SNPs: Chromosome Level

SNP Density

Individual SNPs: Comparing Varieties

Variety-Level SNP Browser

Genealogy and Phenotypes

Link

SNP Linked Plots

Interactive Plot Design

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Conclusions

  • Design of graphics informs our ability to work with data
  • Well designed graphics facilitate further exploration of the data
  • Graphics for high dimensional data may require interactive graphics to

Other Projects

  • Animint - Create web-ready interactive graphics and dashboards within R, using ggplot2 and d3.js

  • Dissertation Research
    • Illusions affecting perception of variability in statistical plots
      (2014 ASA Student paper award)
    • Reading statistical graphics: what visual skills are required?
    • Effect of graphical features (color, shape) on ability to identify “significant” graphs
  • Consulting
    • Shiny applets and dashboards for interactive data display
    • Statistics for power plant reliability

  • Web scraping and data aggregation
    • Craigslist ads
    • OkCupid
    • Location-based energy prices
    • Welder economics: features, utility, and prices

Summary

  • Visualization research is inherently interdisciplinary
  • Statistical graphics makes unique contributions to visualizing large data sets
  • Statistical graphics are important to communicate statistical results to non-statisticians

Acknowledgements

Computation

  • dplyr/plyr
  • reshape2/tidyr
  • CN.MOPS: CNV identification in populations of genetic data

Acknowledgements

Visualization Software

  • ggplot2
  • Animint
    d3 interactive web graphics using ggplot2 syntax in R
  • Shiny (RStudio) interactive web applets
  • Reveal.js (slides) with Rmarkdown and knitr